str(df)
## 'data.frame': 49 obs. of 10 variables:
## $ infected : int 6801 15597 9455 16705 49906 97100 56714 18435 82877 7285 ...
## $ Population..millions. : int 25448054 89977564 164403451 9449847 11580989 212298492 37684705 19087731 1438373881 50788168 ...
## $ Density..Km. : int 3 109 1265 47 383 25 4 26 153 46 ...
## $ median.age : num 37.9 43.5 27.6 40.3 41.9 33.5 41.1 35.3 38.4 31.3 ...
## $ beds...1000 : num 3.8 7.6 0.8 11 6.2 2.2 2.7 2.2 4.2 1.5 ...
## $ GPP...global.pandemic.preparedness: num 75.5 58.5 35 35.3 61 59.7 75.3 58.3 48.2 44.2 ...
## $ Income : Factor w/ 3 levels "high","lower/middle",..: 1 1 2 3 1 3 1 1 3 3 ...
## $ FPY..flights.per.year : int 665384 130260 101383 31676 140674 832683 1475063 134661 4692008 358847 ...
## $ recovered : int 5814 13180 177 3117 12211 40937 23814 9572 78586 1666 ...
## $ deaths : int 94 596 175 97 7765 6761 3684 247 4637 324 ...
## infected Population..millions. Density..Km. median.age
## Min. : 6193 Min. :2.872e+06 Min. : 3 Min. :22.80
## 1st Qu.: 9523 1st Qu.:1.020e+07 1st Qu.: 49 1st Qu.:30.50
## Median : 18435 Median :3.768e+07 Median : 109 Median :38.30
## Mean : 68960 Mean :1.177e+08 Mean : 332 Mean :36.54
## 3rd Qu.: 49906 3rd Qu.:8.998e+07 3rd Qu.: 225 3rd Qu.:42.20
## Max. :1165868 Max. :1.438e+09 Max. :8358 Max. :48.40
## beds...1000 GPP...global.pandemic.preparedness Income
## Min. : 0.600 Min. :35.00 high :28
## 1st Qu.: 1.600 1st Qu.:46.50 lower/middle: 6
## Median : 2.800 Median :55.40 upper/middle:15
## Mean : 3.718 Mean :55.11
## 3rd Qu.: 4.700 3rd Qu.:62.20
## Max. :13.400 Max. :83.50
## FPY..flights.per.year recovered deaths
## Min. : 1421 Min. : 32 Min. : 12
## 1st Qu.: 119148 1st Qu.: 1534 1st Qu.: 229
## Median : 254064 Median : 4326 Median : 664
## Mean : 716587 Mean : 21082 Mean : 4891
## 3rd Qu.: 772926 3rd Qu.: 13386 3rd Qu.: 3336
## Max. :11354693 Max. :175382 Max. :66369
| Country | Number of cases |
|---|---|
| US | 1165868 |
| Spain | 247122 |
| Italy | 209328 |
| United Kingdom | 182260 |
| France | 168396 |
| Country | Number of cases (%) |
|---|---|
| Qatar | 0.5414 |
| Spain | 0.5286 |
| Belgium | 0.4309 |
| Ireland | 0.4297 |
| US | 0.3526 |
| Country | Number of cases |
|---|---|
| US | 175382 |
| Germany | 129000 |
| Spain | 117248 |
| Italy | 79914 |
| China | 78586 |
| Country | Number of cases (%) |
|---|---|
| China | 94.8224 |
| Australia | 85.4874 |
| Austria | 84.5034 |
| Switzerland | 80.9229 |
| Iran | 79.3952 |
| Country | Number of Cases |
|---|---|
| US | 66369 |
| Italy | 28710 |
| United Kingdom | 28205 |
| Spain | 25100 |
| France | 24763 |
| Country | Deaths per T.population (%) |
|---|---|
| Belgium | 0.067 |
| Spain | 0.0537 |
| Italy | 0.0475 |
| United Kingdom | 0.0416 |
| France | 0.038 |
| Country | Deaths per confirmed cases (%) |
|---|---|
| Belgium | 15.5593 |
| United Kingdom | 15.4751 |
| France | 14.7052 |
| Italy | 13.7153 |
| Netherlands | 12.3315 |
| High | Upper/Middle | Lower/Middle |
|---|---|---|
| 28 | 6 | 15 |
| High | Upper/Middle | Lower/Middle |
|---|---|---|
| 2616591 | 695378 | 67079 |
| Country | Number of cases | Income |
|---|---|---|
| US | 1165868 | high |
| Spain | 247122 | high |
| Italy | 209328 | high |
| United Kingdom | 182260 | high |
| France | 168396 | high |
| High | Upper/Middle | Lower/Middle |
|---|---|---|
| 706001 | 316200 | 10803 |
| Country | Number of cases | Income |
|---|---|---|
| US | 175382 | high |
| Germany | 129000 | high |
| Spain | 117248 | high |
| Italy | 79914 | high |
| China | 78586 | upper/middle |
| High | Upper/Middle | Lower/Middle |
|---|---|---|
| 208230 | 28677 | 2743 |
| Country | Number of cases | Income |
|---|---|---|
| US | 66369 | high |
| Italy | 28710 | high |
| United Kingdom | 28205 | high |
| Spain | 25100 | high |
| France | 24763 | high |
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.78283850 42.031539 42.03154
## Dim.2 1.87023532 20.780392 62.81193
## Dim.3 1.06470437 11.830049 74.64198
## Dim.4 0.97301185 10.811243 85.45322
## Dim.5 0.65224698 7.247189 92.70041
## Dim.6 0.35220916 3.913435 96.61385
## Dim.7 0.17428476 1.936497 98.55034
## Dim.8 0.11179442 1.242160 99.79250
## Dim.9 0.01867464 0.207496 100.00000
Key Results: Cumulative, Eigenvalue, barplot In these results, the first four principal components have eigenvalues greater than 1. These three components explain 85.45% of the variation in the data. The barplot shows that the eigenvalues start to form a straight line after the fourth principal component. But to make sens or our results we will use only the first two components as they represent more then 60% of the variation in the data.
## $quanti
## correlation p.value
## infected 0.9455062 1.525080e-24
## deaths 0.9230268 3.976206e-21
## FPY..flights.per.year 0.8813770 6.442516e-17
## recovered 0.8503451 1.061037e-14
## GPP...global.pandemic.preparedness 0.5802849 1.240577e-05
## median.age 0.3624675 1.048339e-02
##
## $quali
## R2 p.value
## clusters 0.8368592 7.743584e-19
##
## $category
## Estimate p.value
## clusters=clusters_1 7.2162854 1.527916e-13
## ds=high 0.8859782 4.086720e-02
## clusters=clusters_2 -2.6437500 2.255358e-02
## clusters=clusters_3 -4.5725354 4.329677e-06
##
## attr(,"class")
## [1] "condes" "list "
The first principal component is strongly correlated with four of the original variables. The first principal component increases with increasing infected cases, deaths recoveries and flights per year. This suggests that these four criteria vary together. If one increases, then the remaining ones tend to increase as well. This component can be viewed as a measure of the quality of infected cases, deaths recoveries and flights per year, and the lack of quality in global.pandemic.preparedness . Furthermore, we see that the first principal component correlates most strongly with the infected cases. In fact, we could state that based on the correlation of 0.945 that this principal component is primarily a measure of the infected cases. It would follow that communities with high values tend to have a lot of human contact, in terms of lack of quarantine,lack of emergency state, etc. Whereas communities with small values would have took early precautions and respected quarantine.
FURTHERMORE, to make more sens and logic out of this , the flight per year is a global index that point us to number of lights of each country per year, we see that this variable is highly correlated with the first component the same as the infected, which means that when this variable increases the infected cases increases too.
PLUS, we can remark that the deaths are more correlated with the infected cases more than the recoveries despite that the recoveries are more then deaths, we can related to how fast was the deaths are happening just after infection " because at first the world was unable to do anything for the infected people just to put them in care and give them pain killers with no treatment" but after couple of weeks they understanded more this pendamic and came up with temporarily treatments.
## $quanti
## correlation p.value
## median.age 0.8617675 1.873946e-15
## beds...1000 0.7746392 6.538558e-11
## GPP...global.pandemic.preparedness 0.4086202 3.557739e-03
## Population..millions. -0.4962679 2.873127e-04
##
## $quali
## R2 p.value
## ds 0.2183344 0.003463241
##
## $category
## Estimate p.value
## ds=high 0.914282 0.001730232
## ds=lower/middle -0.844556 0.017074167
##
## attr(,"class")
## [1] "condes" "list "
The second principal component increases with only two of the values, median.age and beds…1000. This component can be viewed as a measure of how healthy the location is in terms of available beds of hospitals for health care, and the average age of people in that country. FURTHERMORE, as we stated above in variale discription, the the GPP represents how much,on sacle of 100, the country is prepared if a pendamic goes wild. Unfortunatly this variable is not very correlated with the others which means that even the hilghy prepared and top ranked countries were not prepared enough for coronavirus COVID-19 pendamic, so here we can say that it doesn’t matter if you are prepared for this or not your should take more precautions.
## Too few points to calculate an ellipse
As we can see in the graph above that US is concedered as an group by it self and the other two groups have some intersection and that’s probably because they have very close number of infected cases.
Here, we used kmeans segmentation to divide the variables and infact this graph confirme our interpretaion above that the infected cases, deaths recoveries and flight per year are highly correlated and this graph shows that they are in the same group of cluster which is very logic, the same goes for beds per 1000km², median age, and GPP. but we can see that it has grouped the density and population count which is very normal because the dansity is calculated based on the population count in the first place, but the cercle of corelation tells a diffirent story, these last two variables are negativily correlated plus the density per km² is badly represented here.
## Dim.1 Dim.2 Dim.3
## infected 0.94550621 -0.15270460 0.06201992
## Population..millions. 0.22784880 -0.49626791 -0.39805463
## Density..Km. -0.08330142 0.08300299 0.74719579
## median.age 0.36246752 0.86176749 -0.04798405
## beds...1000 0.09999117 0.77463924 -0.46194778
## GPP...global.pandemic.preparedness 0.58028489 0.40862020 0.32171733
## FPY..flights.per.year 0.88137704 -0.27118185 -0.05511578
## recovered 0.85034508 -0.08794865 -0.09747378
## deaths 0.92302684 -0.05283089 0.11121919
## Dim.4 Dim.5
## infected -0.08358150 -0.19939353
## Population..millions. 0.63119902 0.36268821
## Density..Km. 0.62758265 -0.17823251
## median.age 0.19137179 0.10429020
## beds...1000 0.24777144 -0.22882491
## GPP...global.pandemic.preparedness -0.19354036 0.57542262
## FPY..flights.per.year 0.12535300 0.02682055
## recovered 0.03524162 -0.19743798
## deaths -0.14603677 -0.12300428
the top three variables with the best or highest coordinates are “infected with 0.945” , " deaths with 0.923" , “FPY..flights.per.year 0.881”and “recovered 0.85”, here we recall that these variales are highly correlated on the component 1
the top three variables with the best or highest coordinates are “median.age with 0.861” , " beds…1000 with 0.774" and the “GPP…global.pandemic.preparedness” have medium to low coordinates with " 0.408", here we recall that the first two variales are highly correlated on the component 2
## Dim.1 Dim.2 Dim.3
## infected 0.893982000 0.023318696 0.003846471
## Population..millions. 0.051915074 0.246281834 0.158447491
## Density..Km. 0.006939126 0.006889496 0.558301550
## median.age 0.131382705 0.742643211 0.002302469
## beds...1000 0.009998235 0.600065955 0.213395748
## GPP...global.pandemic.preparedness 0.336730558 0.166970471 0.103502041
## FPY..flights.per.year 0.776825494 0.073539593 0.003037749
## recovered 0.723086756 0.007734965 0.009501138
## deaths 0.851978555 0.002791103 0.012369709
## Dim.4 Dim.5
## infected 0.006985866 0.0397577811
## Population..millions. 0.398412207 0.1315427406
## Density..Km. 0.393859979 0.0317668275
## median.age 0.036623161 0.0108764448
## beds...1000 0.061390685 0.0523608389
## GPP...global.pandemic.preparedness 0.037457870 0.3311111941
## FPY..flights.per.year 0.015713375 0.0007193416
## recovered 0.001241972 0.0389817574
## deaths 0.021326737 0.0151300539
the top three variables with the best or highest quality of representation are “infected with 0.893” , " deaths with 0.851" , “FPY..flights.per.year 0.776”and “recovered 0.723”, here we recall that these variales are highly correlated on the component 1
the top three variables with the best or highest quality of representation are “median.age with 0.742” , " beds…1000 with 0.6" and the “GPP…global.pandemic.preparedness” have low quality of representation with " 0.166", here we recall that the first two variales are highly correlated on the component 2
## Dim.1 Dim.2 Dim.3 Dim.4
## infected 23.6325711 1.2468322 0.3612713 0.7179631
## Population..millions. 1.3723841 13.1684944 14.8818297 40.9462851
## Density..Km. 0.1834370 0.3683759 52.4372368 40.4784359
## median.age 3.4731249 39.7085437 0.2162543 3.7638967
## beds...1000 0.2643051 32.0850509 20.0427231 6.3093461
## GPP...global.pandemic.preparedness 8.9015314 8.9277787 9.7212000 3.8496828
## FPY..flights.per.year 20.5355183 3.9321037 0.2853139 1.6149212
## recovered 19.1149254 0.4135825 0.8923734 0.1276420
## deaths 22.5222027 0.1492381 1.1617975 2.1918270
## Dim.5
## infected 6.0955102
## Population..millions. 20.1676274
## Density..Km. 4.8703679
## median.age 1.6675347
## beds...1000 8.0277626
## GPP...global.pandemic.preparedness 50.7646956
## FPY..flights.per.year 0.1102867
## recovered 5.9765332
## deaths 2.3196817
the top three variables with the best or highest Contributions are “infected with 23.63” , " deaths with 22.52" , “FPY..flights.per.year 20.53”and “recovered 19.11”, here we recall that these variales are highly correlated on the component 1
the top three variables with the best or highest Contributions are “median.age with 39.70” , " beds…1000 with 32.085" and the “GPP…global.pandemic.preparedness” have very low Contribution with " 3.932", here we recall that the first two variales are highly correlated on the component 2
Here we can remark that the most of the countries that have high quality of representation on the first component are mostly the medium and high income countries if we make couple steps behind to the 1st dim analysis we said that the infected variable is the most representative on it, plus we expressed it with the variable flights per year, here actually it makes lot of sens because normaly we you ( as a country ) have medium or high income means that you have medium or big aeroports and with you’ll have medium to high number of flights per year ! same goes for the 2nd dimenstion, the more income you have the more health care you will provide for your citizents.
To start our classification and segmentation first we need to choose the right algorithem and methods that works well with our dataset,we will use the clValid package sub-functions to identify the best clustering approach and the optimal number of clusters. We will compare k-means, hierarchical and PAM clustering.
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
##
## Validation Measures:
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
##
## hierarchical Connectivity 5.8579 9.7159 15.7774 23.3472 25.3472 27.2139 29.6806 34.6187 35.9270 40.8806 43.1306 50.6147 52.4480 62.4246 65.4742 67.4742 69.4504 74.3940 79.6794 80.6972 83.5234 86.4274
## Dunn 0.9407 0.6527 0.3284 0.2399 0.2399 0.2399 0.2399 0.2569 0.2569 0.2569 0.2569 0.2569 0.2569 0.3082 0.3082 0.3082 0.3082 0.3313 0.4136 0.4136 0.4136 0.4136
## Silhouette 0.5780 0.5108 0.3796 0.3335 0.3163 0.2712 0.2654 0.3428 0.3356 0.2981 0.2930 0.2953 0.2732 0.2564 0.2361 0.2232 0.2072 0.2147 0.2329 0.2252 0.2161 0.2049
## kmeans Connectivity 5.8579 9.7159 14.9544 19.8425 21.8425 29.0917 31.5095 34.2929 37.2040 41.5643 42.8976 46.5369 55.0206 58.5651 64.9988 66.9988 68.9750 76.1782 80.3810 81.3988 85.1417 87.7345
## Dunn 0.9407 0.6527 0.1241 0.1518 0.1518 0.2211 0.2044 0.2236 0.2367 0.3281 0.3281 0.3281 0.3205 0.3201 0.3201 0.3201 0.3201 0.4051 0.4254 0.4415 0.4528 0.4944
## Silhouette 0.5780 0.5108 0.3527 0.3861 0.3708 0.3841 0.3620 0.3689 0.3640 0.3393 0.3283 0.3232 0.2810 0.2804 0.2740 0.2610 0.2508 0.2226 0.2312 0.2258 0.2076 0.2002
## pam Connectivity 11.0655 14.5123 16.1496 24.1000 25.4290 35.6734 37.6734 44.7210 47.9127 49.2210 58.5310 59.4810 60.9155 63.1655 69.1079 70.9413 72.9413 79.6714 82.4976 84.4738 85.7167 87.4250
## Dunn 0.1020 0.1020 0.1489 0.0916 0.2035 0.1914 0.2079 0.2079 0.2121 0.2121 0.1920 0.2434 0.2594 0.2733 0.2733 0.3300 0.3300 0.3613 0.3613 0.3955 0.4222 0.4944
## Silhouette 0.3051 0.3305 0.3511 0.3998 0.4010 0.2945 0.2800 0.2584 0.2621 0.2739 0.2609 0.2573 0.2734 0.2710 0.2506 0.2539 0.2409 0.2338 0.2247 0.2220 0.2184 0.1984
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 5.8579 hierarchical 3
## Dunn 0.9407 hierarchical 3
## Silhouette 0.5780 hierarchical 3
Connectivity and Silhouette are both measurements of connectedness while the Dunn Index is the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance. as for now we will use 3 clusters and the ‘hierarchical’ method.
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 3 as the best number of clusters
## * 5 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 3 proposed 8 as the best number of clusters
## * 3 proposed 9 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 5 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 3 proposed 8 as the best number of clusters
## * 3 proposed 9 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 3 .
so this histograme above shows us that the majority of creterias has proposed 3 clusters
this dendrogram above tells the same story as the results from the clValid package, and the PCA individuals plot clustred by kmeans. three major clusters are present here, if we make the relation between the classification results and the PCA results we can say that the blue cluster (right group) are the medium to low and low incom countries, as result these countries have the lowest infected cases recorded and that’s because as we said before they dont have a high number of flights, and the green cluster (middle group) are the medium to high and high incom countries respectively medium to high and high number of flights per year, the last cluster the red cluster it is an individual by it’s self and that’s normal because US have the highest number of flights per year and now it’s recording the highest number of infected cases and it represent by it’s self 50% of the infected cases!
In this part we will classify the countries by a risk indicator called ‘state’ from 0 and 1 respectevily means low and high risk that we will create as follows:
if the total deaths are more then 3% of the total infected cases the country will take class 1 which is the highest and means it is very serious condition!
if the total deaths are less then 3% of the total infected cases the country will take class 0 which is the lowest and means the general condition is under control!
here we can see that 30 countries are at high risk, very indangered by this pendamic and 19 countries are take control over it!in this section we will not use the RANDOM FOREST because we dont have many individuals to create the unique noeds(default in R is 500), instead we will use the DECISION TREE:
the results here are very very interesting:
The tree structure is ideal for capturing interactions between features in the data. the tree tells us that if a country have deaths under 302 and a median age under 41 the risk indicator is ‘LOW’ which means that the country will be able to control the situation, and if it have median age above 41 it will have 6% chance to loose control, that means the infected people above the age of 41 will have a greater risk to die it they got infected!
now talking about the right side of the tree if a country have more then 302 deaths we will consider and other indicator which is the global pandemic preparedness, if this last is more then 54% there is a high chance for the country to be at high risk from this pandemic and it’s risk indicator is ‘1’HIGH’ which is not very logical.
if a country is prepared for a pandemic why will it be at risk ?
This confirmes the results from the PCA this unlogic results means two things :
first, this pandemic was beyond the expectations, which is a fact!
second, the countries who thought that they are prepared and didn’t follow the instructions for the lockdown, quarantine and they
relied on thier general health care indicators ( bed per 1000km², global pandemic preparedness, rehabitation beds) they got a hard
nockdown and got the highest numbers of infected cases, and the US is the best exemple for this.
FURTHERMORE, if a country have a GPP under 54% and recoveries above 4682 means that they are partially okay! but they are still indangered. But if they have recoveries under 4682 they are definitely at risk!
## PREDICTED
## ORIGINAL HIGH LOW
## HIGH 26 4
## LOW 0 19
As we can above the confusion matrix show only 4 individuals are misplaced
## [1] "THE PRECISION OF THE CORRECT PLACED PREDICTIONS IS :"
## [1] 0.9183673
## [1] "THE PRECISION OF THE ERRORS IS :"
## [1] 0.08163265
here we have high precision which is very good!
in this part we will work on the logistic regression and we will use the automatic methods ‘BACKWARD’ and ‘FORWARD’ with the ‘BOTH’ argument in the stepAIC function: Furthermore, we will make a little modification here on our state index, we will calculate the totale active cases on the totale infected cases and we will round this number and it will either 1 or 0, to understand more 1 will be the high risk and 0 the low risk! and we are going to remove the infected column to avoid the over fitting.
HERE’s the results:
## Start: AIC=60.63
## state ~ 1
##
## Df Deviance AIC
## + recovered 1 53.213 57.213
## + GPP...global.pandemic.preparedness 1 56.354 60.354
## <none> 58.630 60.630
## + median.age 1 56.999 60.999
## + Density..Km. 1 57.427 61.427
## + ds 1 57.966 61.966
## + Population..millions. 1 58.444 62.444
## + deaths 1 58.560 62.560
## + beds...1000 1 58.599 62.599
## + FPY..flights.per.year 1 58.605 62.605
##
## Step: AIC=57.21
## state ~ recovered
##
## Df Deviance AIC
## + deaths 1 45.445 51.445
## + FPY..flights.per.year 1 49.233 55.233
## <none> 53.213 57.213
## + GPP...global.pandemic.preparedness 1 52.442 58.442
## + Density..Km. 1 52.523 58.523
## + median.age 1 52.719 58.719
## + ds 1 52.747 58.747
## + beds...1000 1 53.093 59.093
## + Population..millions. 1 53.196 59.196
## - recovered 1 58.630 60.630
##
## Step: AIC=51.45
## state ~ recovered + deaths
##
## Df Deviance AIC
## + GPP...global.pandemic.preparedness 1 40.582 48.582
## + ds 1 41.626 49.626
## <none> 45.445 51.445
## + median.age 1 44.174 52.174
## + Population..millions. 1 44.393 52.393
## + FPY..flights.per.year 1 44.548 52.548
## + Density..Km. 1 44.981 52.981
## + beds...1000 1 45.337 53.337
## - deaths 1 53.213 57.213
## - recovered 1 58.560 62.560
##
## Step: AIC=48.58
## state ~ recovered + deaths + GPP...global.pandemic.preparedness
##
## Df Deviance AIC
## + FPY..flights.per.year 1 38.547 48.547
## <none> 40.582 48.582
## + ds 1 39.226 49.226
## + Population..millions. 1 40.153 50.153
## + Density..Km. 1 40.205 50.205
## + beds...1000 1 40.379 50.379
## + median.age 1 40.547 50.547
## - GPP...global.pandemic.preparedness 1 45.445 51.445
## - deaths 1 52.442 58.442
## - recovered 1 56.040 62.040
##
## Step: AIC=48.55
## state ~ recovered + deaths + GPP...global.pandemic.preparedness +
## FPY..flights.per.year
##
## Df Deviance AIC
## <none> 38.547 48.547
## - FPY..flights.per.year 1 40.582 48.582
## + ds 1 37.422 49.422
## + Density..Km. 1 38.185 50.185
## + beds...1000 1 38.301 50.301
## + median.age 1 38.514 50.514
## + Population..millions. 1 38.533 50.533
## - GPP...global.pandemic.preparedness 1 44.548 52.548
## - deaths 1 47.035 55.035
## - recovered 1 56.031 64.031
this is the final model selected by the function:
##
## Call:
## glm(formula = state ~ recovered + deaths + GPP...global.pandemic.preparedness +
## FPY..flights.per.year, family = binomial, data = (newdf))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8705 -0.1346 0.3453 0.6051 2.2430
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.949e+00 2.613e+00 2.660 0.00782 **
## recovered -9.497e-05 3.510e-05 -2.706 0.00682 **
## deaths 2.168e-04 9.524e-05 2.276 0.02283 *
## GPP...global.pandemic.preparedness -9.647e-02 4.370e-02 -2.208 0.02727 *
## FPY..flights.per.year 6.266e-07 5.219e-07 1.201 0.22991
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 58.630 on 48 degrees of freedom
## Residual deviance: 38.547 on 44 degrees of freedom
## AIC: 48.547
##
## Number of Fisher Scoring iterations: 6
as we can see from the results above, most the p-values are very significant
in this section we will see the accuracy of the predicted selected model :
## [1] "THE ACCURACY OF THE CORRECT PLACED PREDICTIONS IS :"
## [1] 0.8461538
## [1] "THE ACCURACY OF THE ERRORS IS :"
## [1] 0.1538462
we have a good accuracy!
the number 1 factor in the evolution of infected cases is the flights per year indicator!
proof : USA has the highest number of Flights per year and the Highest number of infected cases
the GPP, global pandemic preparedness indicator does not explain the percentage of the recoveries.
the general health idicators (see in PCA results), did not explain if a country could be effective against the virus or not !
the numbers of deaths increases where a country have a median age above 41.